Search CORE

Using machine learning to speed up manual image annotation: application to a 3D imaging protocol for measuring single cell gene expression in the developing C. elegans embryo

Author: AE Carpenter
BE Boser
CC Chang
G Lin
G Lin
JI Murray
JI Murray
John I Murray
M Wang
M Wang
MR Lamprecht
MS Vokes
R Wollman
RA Russell
Robert H Waterston
S Hamahashi
S Sanei
TJ Boyle
William S Noble
WS Noble
X Chen
Z Bao
Zafer Aydin
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Image analysis is an essential component in many biological experiments that study gene expression, cell cycle progression, and protein localization. A protocol for tracking the expression of individual <it>C. elegans </it>genes was developed that collects image samples of a developing embryo by 3-D time lapse microscopy. In this protocol, a program called StarryNite performs the automatic recognition of fluorescently labeled cells and traces their lineage. However, due to the amount of noise present in the data and due to the challenges introduced by increasing number of cells in later stages of development, this program is not error free. In the current version, the error correction (<it>i.e</it>., editing) is performed manually using a graphical interface tool named AceTree, which is specifically developed for this task. For a single experiment, this manual annotation task takes several hours. Results In this paper, we reduce the time required to correct errors made by StarryNite. We target one of the most frequent error types (movements annotated as divisions) and train a support vector machine (SVM) classifier to decide whether a division call made by StarryNite is correct or not. We show, via cross-validation experiments on several benchmark data sets, that the SVM successfully identifies this type of error significantly. A new version of StarryNite that includes the trained SVM classifier is available at <url>http://starrynite.sourceforge.net</url>. Conclusions We demonstrate the utility of a machine learning approach to error annotation for StarryNite. In the process, we also provide some general methodologies for developing and validating a classifier with respect to a given pattern recognition task.</p

A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data

Author: A Enright
A Gavin
A Grigoriev
A Hoerl
AJ Dobson
EG WS Cleveland
G GH
GRG Lanckriet
H Ge
M Deng
M Eisen
M Fellenberg
MPS Brown
O Troyanskaya
P Liang
P Pavlidis
P Pavlidis
R Overbeek
R Tibshirani
Walter L Ruzzo
WS Noble
Y Zheng
Zizhen Yao
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources. METHODS: In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems. RESULTS: We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM) algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly. CONCLUSION: Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets

Public Library of Science (PLOS)

High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions

Author: Aaron Arvey
C Kissinger
C Leslie
C Zhu
Christina Leslie
CT Harbison
D Fulton
DE Newburger
E Bolotin
E Fraenkel
G Badis
G Badis
G Pavesi
MF Berger
O Wallerman
P Kharchenko
Phaedra Agius
R Kuang
S Georgiev
Uwe Ohler
William Chang
William Stafford Noble
WS Noble
X Chen
X Chen
XS Liu
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding

CiteSeerX

Physicochemical property distributions for accurate and rapid pairwise protein homology detection

Author: A Ben-Hur
A Kumar
AG Murzin
AR Shah
B Liu
BJ Webb-Robertson
BJ Webb-Robertson
BJ Webb-Robertson
Bobbie-Jo M Webb-Robertson
C Leslie
Christopher S Oehmen
CS Leslie
H Rangwala
H Saigo
I Jung
I Melvin
I Melvin
J Weston
Kyle G Ratuiste
L Liao
NH Anderson
QW Dong
R Kuang
S Hochreiter
SF Altschul
SF Altschul
T Damoulas
T Lingner
TF Smith
WS Noble
WS Noble
Y Hou
Y Hou
Y Yang
Y Yuan
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The challenge of remote homology detection is that many evolutionarily related sequences have very little similarity at the amino acid level. Kernel-based discriminative methods, such as support vector machines (SVMs), that use vector representations of sequences derived from sequence properties have been shown to have superior accuracy when compared to traditional approaches for the task of remote homology detection. Results We introduce a new method for feature vector representation based on the physicochemical properties of the primary protein sequence. A distribution of physicochemical property scores are assembled from 4-mers of the sequence and normalized based on the null distribution of the property over all possible 4-mers. With this approach there is little computational cost associated with the transformation of the protein into feature space, and overall performance in terms of remote homology detection is comparable with current state-of-the-art methods. We demonstrate that the features can be used for the task of pairwise remote homology detection with improved accuracy versus sequence-based methods such as BLAST and other feature-based methods of similar computational cost. Conclusions A protein feature method based on physicochemical properties is a viable approach for extracting features in a computationally inexpensive manner while retaining the sensitivity of SVM protein homology detection. Furthermore, identifying features that can be used for generic pairwise homology detection in lieu of family-based homology detection is important for applications such as large database searches and comparative genomics.</p

Public Library of Science (PLOS)

Predicting mental imagery based BCI performance from personality, cognitive profile and neurophysiological patterns

Mental-Imagery based Brain-Computer Interfaces (MI-BCIs) allow their users to send commands to a computer using their brain-activity alone (typically measured by ElectroEncephaloGraphy— EEG), which is processed while they perform specific mental tasks. While very promising, MI-BCIs remain barely used outside laboratories because of the difficulty encountered by users to control them. Indeed, although some users obtain good control performances after training, a substantial proportion remains unable to reliably control an MI-BCI. This huge variability in user-performance led the community to look for predictors of MI-BCI control ability. However, these predictors were only explored for motor-imagery based BCIs, and mostly for a single training session per subject. In this study, 18 participants were instructed to learn to control an EEG-based MI-BCI by performing 3 MI-tasks, 2 of which were non-motor tasks, across 6 training sessions, on 6 different days. Relationships between the participants’ BCI control performances and their personality, cognitive profile and neurophysiological markers were explored. While no relevant relationships with neurophysiological markers were found, strong correlations between MI-BCI performances and mental-rotation scores (reflecting spatial abilities) were revealed. Also, a predictive model of MI-BCI performance based on psychometric questionnaire scores was proposed. A leave-one-subject-out cross validation process revealed the stability and reliability of this model: it enabled to predict participants’ performance with a mean error of less than 3 points. This study determined how users’ profiles impact their MI-BCI control ability and thus clears the way for designing novel MI-BCI training protocols, adapted to the profile of each user

INRIA a CCSD electronic archive server

Sussex Research Online

Functional SNP allele discovery (fSNPd): an approach to find highly penetrant, environmental-triggered genotypes underlying complex human phenotypes

Author: C. Geoffrey Woods
David Menon
EL Kwak
G Gibson
JL Haines
Kaitlin Stouffer
MD DH
Michael Lee
Michael Nahorski
MW Foster
Nivedita Sarveswaran
Pablo Moreno
PM Visscher
RA Wilke
S Sawcer
TR Prezant
VM Ingram
WH Chung
WS Noble
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Public Library of Science (PLOS)

Rational Design of Temperature-Sensitive Alleles Using Computational Structure Prediction

Author: B Cunningham
B Lee
C Cortes
Ca Rohl
Christopher S. Poultney
CJ Burges
David Gresham
Dennis E. Shasha
EH Kellogg
G Chakshusmathi
Glenn L. Butterfoss
HM Muller
JM Word
JR Quinlan
K Bajaj
K Drew
KD Pruitt
Kevin Drew
Kristin C. Gunsalus
M Hall
Michelle R. Gutwein
N Eswar
N Siew
R Varadarajan
Richard Bonneau
RJ Dohmen
S Tweedie
SF Altschul
SF Altschul
TW Harris
Vladimir N. Uversky
WS Noble
WS Sandberg
Publication venue: Public Library of Science
Publication date: 02/09/2011
Field of study

Temperature-sensitive (ts) mutations are mutations that exhibit a mutant phenotype at high or low temperatures and a wild-type phenotype at normal temperature. Temperature-sensitive mutants are valuable tools for geneticists, particularly in the study of essential genes. However, finding ts mutations typically relies on generating and screening many thousands of mutations, which is an expensive and labor-intensive process. Here we describe an in silico method that uses Rosetta and machine learning techniques to predict a highly accurate “top 5” list of ts mutations given the structure of a protein of interest. Rosetta is a protein structure prediction and design code, used here to model and score how proteins accommodate point mutations with side-chain and backbone movements. We show that integrating Rosetta relax-derived features with sequence-based features results in accurate temperature-sensitive mutation predictions

BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features

Author: AP Bradley
AR Panchenko
C Yan
Caiyan Huang
CH Wu
DE Draper
E Bechara
IB Kuznetsov
JA Swets
Jack Y Yang
JC Darnell
L Wang
L Wang
Liangjiang Wang
M Terribilini
Mary Qu Yang
P Baldi
S Ahmad
S Ahmad
S Hwang
S Jones
SF Altschul
T Joachims
WS Noble
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Understanding how biomolecules interact is a major task of systems biology. To model protein-nucleic acid interactions, it is important to identify the DNA or RNA-binding residues in proteins. Protein sequence features, including the biochemical property of amino acids and evolutionary information in terms of position-specific scoring matrix (PSSM), have been used for DNA or RNA-binding site prediction. However, PSSM is rather designed for PSI-BLAST searches, and it may not contain all the evolutionary information for modelling DNA or RNA-binding sites in protein sequences. Results In the present study, several new descriptors of evolutionary information have been developed and evaluated for sequence-based prediction of DNA and RNA-binding residues using support vector machines (SVMs). The new descriptors were shown to improve classifier performance. Interestingly, the best classifiers were obtained by combining the new descriptors and PSSM, suggesting that they captured different aspects of evolutionary information for DNA and RNA-binding site prediction. The SVM classifiers achieved 77.3% sensitivity and 79.3% specificity for prediction of DNA-binding residues, and 71.6% sensitivity and 78.7% specificity for RNA-binding site prediction. Conclusions Predictions at this level of accuracy may provide useful information for modelling protein-nucleic acid interactions in systems biology studies. We have thus developed a web-based tool called BindN+ (http://bioinfo.ggc.org/bindn+/) to make the SVM classifiers accessible to the research community

IUPUIScholarWorks

eScholarship - University of California

Support vector machine versus logistic regression modeling for prediction of hospital mortality in critically ill patients with haematological malignancies

Author: A Chu
A Vinayagam
B Boser
B Giraldo
B Schölkopf
C Cortes
D Benoit
D Hand
D Hosmer
DD Benoit
DD Benoit
E Byvatov
ER DeLong
F De Turck
I Guyon
J Decruyenaere
JE Zimmerman
JS Groeger
L Ohno-Machado
M Soares
P Depuydt
S Lemeshow
S Van Looy
S Van Looy
S Van Looy
S Vansteelandt
SL Zeger
T Verplancke
WS Noble
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: Several models for mortality prediction have been constructed for critically ill patients with haematological malignancies in recent years. These models have proven to be equally or more accurate in predicting hospital mortality in patients with haematological malignancies than ICU severity of illness scores such as the APACHE II or SAPS II [1]. The objective of this study is to compare the accuracy of predicting hospital mortality in patients with haematological malignancies admitted to the ICU between models based on multiple logistic regression (MLR) and support vector machine (SVM) based models. Methods: 352 patients with haematological malignancies admitted to the ICU between 1997 and 2006 for a life-threatening complication were included. 252 patient records were used for training of the models and 100 were used for validation. In a first model 12 input variables were included for comparison between MLR and SVM. In a second more complex model 17 input variables were used. MLR and SVM analysis were performed independently from each other. Discrimination was evaluated using the area under the receiver operating characteristic (ROC) curves (+/- SE). Results: The area under ROC curve for the MLR and SVM in the validation data set were 0.768 (+/- 0.04) vs. 0.802 (+/- 0.04) in the first model (p = 0.19) and 0.781 (+/- 0.05) vs. 0.808 (+/- 0.04) in the second more complex model (p = 0.44). SVM needed only 4 variables to make its prediction in both models, whereas MLR needed 7 and 8 variables in the first and second model respectively. Conclusion: The discriminative power of both the MLR and SVM models was good. No statistically significant differences were found in discriminative power between MLR and SVM for prediction of hospital mortality in critically ill patients with haematological malignancies

Ghent University Academic Bibliography